UIMA SDK Overview

IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies.

The UIMA framework provides a run-time environment in which developers can plug in and run their UIMA component implementations and with which they can build and deploy UIM applications. The framework is not specific to any IDE or platform.

The UIMA Software Development Kit (SDK) includes an all-Java implementation of the UIMA framework for the development, description, composition and deployment of UIMA components and applications. It also provides the developer with an Eclipse-based (www.eclipse.org) development environment that includes a set of tools and utilities for using UIMA.

This chapter is the intended starting point for readers that are new to the UIMA SDK. It includes this introduction and the following sections:

Chapter

Description

Overviews

UIMA SDK Overview (This Chapter)

Lists the documents provided in the UIMA SDK documentation set.

Provides a recommended path through the documentation for getting started using UIMA.

Includes release notes.

Provides a brief high-level description of the different software modules included in the UIMA SDK.

UIMA Conceptual Overview

Provides a broad conceptual overview of the UIMA component architecture making contextual references to the other documents in the UIMA SDK documentation set that provide more detail.

Setting up

UIMA Eclipse Tooling Installation and Setup

Provides step-by-step instructions for installing the UIMA SDK in the Eclipse Interactive Development Environment.

Developer's Guides

Annotator and AE Developer's Guide

Tutorial-style guide for building UIMA annotators and analysis engines. This chapter introduces the developer to creating type systems and using UIMA’s common data structure, the CAS or Common Analysis Structure. It demonstrates how to use built in tools to specify and create basic UIMA analysis components.

CPE Developer's Guide

Tutorial-style guide for building UIMA collection processing engines. These manage the analysis of collections of documents from source to sink.

Application Developer's Guide

Tutorial-style guide for using UIMA SDK to create, run and manage UIMA components from your application. Includes integration with semantic search engine and description of a simple GUI provided for submitting and running Semantic Search queries that can exploit UIMA analysis. Also describes APIs for saving and restoring the contents of a CAS using an XML format called XCAS.

Flow Controller Developer's Guide

When multiple components are combined in an Aggregate, each CAS flow among the various components. UIMA provides two built-in flows, and also allows custom flows to be implemented.

Developing Applications using Multiple Subjects of Analysis (Sofas)

A single CAS maybe associated with multiple subjects of analysis (Sofas). These are useful for representing and analyzing different formats or translations of the same document. For multi-modal analysis, Sofas are good for different modal representations of the same stream (e.g., audio and close-captions).This chapter provides the developer details on how to use multiple Sofas in an application.

CAS Multiplier Developer's Guide

A component may add additional CASes into the workflow. This may be useful to break up a large artifact into smaller units, or to create a new CAS that collects information from multiple other CASes.

XMI® and EMF Interoperability

The UIMA Type system and the contents of the CAS itself can be externalized using the XMI standard for XML MetaData. Eclipse Modeling Framework (EMF) tooling can be used to develop applications that use this information.

Tool User Guides

Component Descriptor Editor

Describes the features of the Component Descriptor Editor Tool. This tool provides a GUI for specifying the details of UIMA component descriptors, including those for Analysis Engines (primitive and aggregate), Collection Readers, CAS Consumers and Type Systems.

CPE Configurator

Describes the User Interfaces and features of the CPE Configurator tool. This tool allows the user to select and configure the components of a Collection Processing Engine and then to run the engine.

PEAR Packager

Describes how to use the PEAR Packager utility. This utility enables developers to produce an archive file for an analysis engine that includes all required resources for installing that analysis engine in another UIMA environment.

PEAR Installer

Describes how to use the PEAR Installer utility. This utility installs and verifies an analysis engine from an archive file (PEAR) with all its resources in the right place so it is ready to run.

PEAR Merger User's Guide

Merges multiple PEAR packages into one.

Document Analyzer

Describes the features of a tool for applying a UIMA analysis engine to a set of documents and viewing the results.

CAS Visual Debugger

Describes the features of a tool for viewing the detailed structure and contents of a CAS. Good for debugging.

JCasGen

Describes how to run the JCasGen utility, which automatically builds Java classes that correspond to a particular CAS Type System.

XCAS Viewer

Describes how to run the supplied viewer for XCASes, used in the examples.

References

UIMA FAQs

Frequently Asked Questions about general UIMA concepts. (Not a programming resource.)

Glossary

Main UIMA concepts and their basic definitions.

Component Descriptor Reference

Provides detailed XML format for all the UIMA component descriptors, except the CPE (see next)

CPE Descriptor Reference

Provides detailed XML format for the Collection Processing Engine descriptor.

JavaDocs

JavaDocs detailing the UIMA SDK programming interfaces

CAS Reference

Provides detailed description of the principal CAS interface.

JCas Reference

Provides details on the JCas, a native Java interface to the CAS.

Semantic Search Engine Reference

Describes how to write applications that query a semantic search engine index built using the UIMA SDK.

PEAR Reference

Provides detailed description of the deployable archive format for UIMA components.

XMI CAS Serialization Reference

Provides details about the XMI CAS Serialization

  1. Explore this chapter to get an overview of the different documents that are included with the SDK.
  2. Read Chapter 2, UIMA Conceptual Overview to get a broad view of the basic UIMA concepts and philosophy with reference to the other documents included in the SDK which provide greater detail.
  3. For more general information on the UIMA architecture and how it has been used, refer to the IBM Systems Journal special issue on Unstructured Information Management, on line at http://www.research.ibm.com/journal/sj43-3.html or to the external UIMA website where key publications are listed http://www.research.ibm.com/UIMA/pubs.htm.
  4. Set up the UIMA SDK in your Eclipse environment. To do this, follow the instructions in Chapter 3, UIMA SDK Setup for Eclipse.
  5. Develop sample UIMA annotators, run them and explore the results. Read Chapter 4, Annotator and Analysis Engine Developer’s Guide and follow it like a tutorial to learn how to develop your first UIMA annotator and set up and run your first UIMA analysis engines.
  6. Learn how to create, run and manage a UIMA analysis engine as part of an application. Connect your analysis engine to the provided semantic search engine to learn how a complete analysis and search application may be built with the UIMA SDK. Chapter 6, Application Developer’s Guide will guide you through this process.
  7. Pat yourself on the back. Congratulations! If you reached this step successfully, then you have an appreciation for the UIMA analysis engine architecture. You would have built a few sample annotators, deployed UIMA analysis engines to analyze a few documents, searched over the results using the built-in semantic search engine and viewed the results through a built-in viewer – all as part of a simple but complete application.
  8. Develop and run a Collection Processing Engine (CPE) to analyze and gather the results of an entire collection of documents. Chapter 5, Collection Processing Engine Developer's Guide will guide you through this process.

Learn how to package up an analysis engine for easy installation into another UIMA environment. Chapter 14, PEAR Packager and Chapter 15, PEAR Installer User's Guide will teach you how to create UIMA analysis engine archives so that you can easily share your components with a broader community.

Version 2.0 provide new capabilities and refines several areas of the UIMA architecture.

New Capabilities

New Primitive data types

UIMA now supports Boolean (bit), Byte, Short (16 bit integers), Long (64 bit integers), and Double (64 bit floating point) primitive types, and arrays of these. These types can be used like all the other primitive types.

Simpler Analysis Engines and CASes

Version 1.x made a distinction between Analysis Engines and Text Analysis Engines. This distinction has been eliminated in Version 2 - new code should just refer to Analysis Engines. Analysis Engines can operate on multiple kinds of artifacts, including text.

Version 1.x made a distinction between CASes and TCASes. TCAS are now deprecated; new code should just refer to CASes. The JCas capability to have a Java-friendly way to work with CAS types remains; we clarify that the JCas is just (one of potentially several) interfaces to the CAS.

Sofas and CAS Views simplified

The APIs for manipulating multiple subjects of analysis (Sofas) and their corresponding CAS Views have been simplified.

Analysis Component generalized to support multiple new CAS outputs

Analysis Components, in general, can make use of new capabilities to return multiple new CASes, in addition to returning the original CAS that is passed in. This allows components to have Collection Reader-like capabilities, but be placed anywhere in the flow. See CAS Multiplier Developer's Guide .

User-customized flow controllers

A new component, the Flow Controller, can be supplied by the user to implement arbitrary flow control for CASes within an Aggregate. This is in addition to the two built-in flow control choices of linear and language-capability flow. See Flow Controller Developer's Guide .

Search Engine updated with new capability to index Annotation feature values

The search engine that is provided with the UIMA SDK has been upgraded to a later release; it is more scalable and now has the ability to index additional information from Annotations. The SIAPI.pdf reference documentation for this has been updated. The SemanticSearchCasIndexer now supports indexing individual features of annotations in addition to their types.

Backwards Compatibility

For the most part, applications and components should work unchanged under version 2.0 However, please note the following non-compatible changes:

  • The format for indexes produced by the SemanticSearchCasIndexer has changed. Indexes that were generated using the v1.x SDK cannot be read with v2.0. You must reindex your content in v2.0.
  • There have been some changes to ResultSpecifications. We do not guarantee 100% backwards compatibility for applications that made use of them, although most cases should work.
  • For applications that deal with multiple subjects of analysis (Sofas), the rules that determine whether a component is Multi-View or Single-View have been made more consistent. A component is considered Multi-View if and only if it declares at least one inputSofa or outputSofa in its descriptor. This leads to the following incompatibilities in unusual cases:
    • It is an error if an annotator that implements the TextAnnotator or JTextAnnotator interface also declares inputSofas or outputSofas in its descriptor. Such annotators must be Single-View.
    • Annotators that implement GenericAnnotator but do not declare any inputSofas or outputSofas will now be passed the view of default Sofa instead of the Base CAS.

Other changes

TextAnalysisEngine has been deprecated - it is now no different than AnalysisEngine. Previous code that uses this should still continue to work, however.

Methods that were defined on the TCAS interface have been moved to the base CAS interface; the TCAS interface is no longer needed.

The DocumentAnalyzer tool saves outputs in the new XMI serialization format. The XCasAnnotationViewer and SemanticSearchGUI tools can read both the new XMI format and the previous XCAS format.

General

The UIMA SDK supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.

It includes APIs and tools for creating analysis components. Examples of analysis components include tokenizers, summarizers, categorizers, parsers, named-entity detectors etc. Tutorial examples are provided with the SDK; additional components are available from the community.

The UIMA SDK also includes a semantic search engine for indexing the results of analysis and for using this semantic index to perform more advanced search.

Programming Language Support

UIMA supports the development and integration of analysis algorithms developed in different programming languages.

The SDK is principally focussed on Java development. It also includes facilities for C++ Enablement for UIMA Components which allow UIMA components to be written in C++ and have access to a C++ version of the CAS. When used in this manner, the Java UIMA framework can incorporate analytic functions written in C++. Optional files included with the UIMA SDK describe this functionality and provide example code. See the Quick Start manual for more information on this.

Other languages, including Python, Perl, and TCL, are being added to the list.

Multi-Modal Support

The UIMA architecture supports the development, discovery, composition and deployment of multi-modal analytics, including text, audio and video. Annotations, Artifacts, and S discuss this is more detail.

Availability and Open Source

The SDK is available from IBM's alphaWorks (http://www.alphaworks.ibm.com/tech/uima). The source code for the main UIMA framework is available on SourceForge (http://uima-framework.sourceforge.net ).

Module

Description

UIMA Framework Core

A framework integrating core functions for creating, deploying, running and managing UIMA components, including analysis engines and Collection Processing Engines in collocated and/or distributed configurations.

The framework includes an implementation of core components for transport layer adaptation, CAS management, workflow management based on declarative specifications, resource management, configuration management, logging, and other functions.

C++ and other programming language Interoperability

Includes C++ CAS and supports the creation of UIMA compliant C++ components that can be deployed in the UIMA run-time through a built-in JNI adapter. This includes high-speed binary serialization.

Includes support for creating service-based UIMA engines outside of SDK. This is ideal for wrapping existing code written in different languages.

Externalized Framework Plug-ins

Note that interfaces of these components are available to the developer but different implementations are possible in different implementations of the UIMA framework.

CAS

These classes provide the developer with typed access to the Common Analysis Structure (CAS), including type system schema, elements, subjects of analysis and indices. Multiple subjects of analysis (Sofas) mechanism supports the independent or simultaneous analysis of multiple views of the same artifacts (e.g. documents), supporting multi-lingual and multi-modal analysis.

JCas

An alternative interface to the CAS, providing Java-based UIMA Analysis components with native Java object access to CAS types and their attributes or features, using the JavaBeans conventions of getters and setters.

Collection Processing Management (CPM)

Core functions for running UIMA collection processing engines in collocated and/or distributed configurations. The CPM provides scalability across parallel processing pipelines, check-pointing, performance monitoring and recoverability.

Resource Manager

Provides UIMA components with run-time access to external resources handling capabilities such as resource naming, sharing, and caching.

Configuration Manager

Provides UIMA components with run-time access to their configuration parameter settings.

Logger

Provides access to a common logging facility.

Tools and Utilities

JCasGen

Utility for generating a Java object model for CAS types from a UIMA XML type system definition.

Saving and Restoring CAS contents

APIs in the core framework support saving and restoring the contents of a CAS to streams using an XMI format.

PEAR packager for Eclipse

Tool for building a UIMA component archive to facilitate porting, registering, installing and testing components.

PEAR Installer

Tool for installing and verifying a UIMA component archive in a UIMA installation.

PEAR Merger

Utility that combines multiple PEARs into one.

Component Descriptor Editor

Eclipse Plug-in for specifying and configuring component descriptors for UIMA analysis engines as well as other UIMA component types including Collection Readers and CAS Consumers.

CPE Configurator

Graphical tool for configuring Collection Processing Engines and applying them to collections of documents.

Java Annotation viewer

Viewer for exploring annotations and related CAS data.

CAS Visual Debugger

Provides developer with detailed visual view of the contents of a CAS.

Document Analyzer

Graphical tool for applying analysis engines to sets of documents and viewing results.

Example Analysis Components

Semantic Search CAS Indexer

CAS Consumer that uses the semantic search engine indexer to build an index from a stream of CASes. Requires the semantic search engine (included).

Database Writer

CAS Consumer that writes the content of selected CAS types into a relational database, using JDBC. This code is in the doc/examples/src/com/ibm/uima/examples/
cpe/PersonTitleDBWriterCasConsumer

Annotators

Set of simple annotators meant for pedagogical purposes. Includes: Date/time, Room-number, Regular expression, Tokenizer, and Meeting-finder annotator. There are also sample Annotators in C++ and Python. There are sample CAS Multipliers as well.

Flow Controllers

There is a sample flow-controller based on the whiteboard concept of sending the CAS to whatever annotator hasn't yet processed it, when that annotator's inputs are available in the CAS.

File System Collection Reader

Simple Collection Reader for pulling documents from the file system and initializing CASes.

XMI Collection Reader,
Cas Consumer

Reads and writes the CAS in XMI format

Search Components

Semantic Search Engine

Search Engine that supports searching over results of analysis including annotations and nested annotations using the "XML Fragment" query language.

Components not currently available in this release of the UIMA SDK.

If interested in these extensions please contact the UIMA team at IBM. T.J. Watson Research Center via www.ibm.com/research/uima

Semantic search and Analysis Workbench (SAW)

Graphical User Interface for applying analysis to build search indices and DBs and query interfaces for searching/exploring analysis results. Uses the semantic search engine and the EKDB (see below).

Extracted Knowledge Database (EKDB)

Database schema and APIs for creating and populating a relational database with analysis results including entity and relation annotations. Includes a CAS Consumer that populates the database. Semantic Analysis Workbench provides a front-end to this database and to the Semantic Search Engine’s query processor.

UIMA SDK Capabilities